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I.  Introduct ion 


The  goal  of  the  research  is  to  apply  analytical 
approximation  techniques  to  the  problem  of  practically  evaluating 
fault-tolerant  control  system  reliability  and  availability  where 
the  system  behavior  is  modelled  by  a  finite  state  Markov  or 
semi-Markov  process.  The  key  property  of  fault-tolerant  control 
systems  to  be  exploited  is  that  component  failures  tend  to  occur 
very  infrequently  relative  to  decisions  by  the  redundancy 
management  (RM)  system,  which  include  false  detection  alarms, 
detections  of  faults,  identification  of  faulty  components  and 
rejection  of  false  alarms.  This  property  tends  to  cause  the 
resulting  Markovian  model  to  exhibit  behavior  in  two  (or  more) 
distinct  time  scales:  a  fast  time  scale  for  the  RM  decisions  and 
a  slow  time  scale  for  the  component  failures.  Of  interest  for 
reliability  and  availability  studies  is  usually  the  behavior  that 
occurs  over  durations  intermediate  to  the  two  time  scales.  Under 
certain  conditions,  this  behavior  can  be  approximated  by  an 
aggregated  model  whose  aggregated  state  classes  reflect  primarily 
the  number  of  component  failures.  Therefore,  the  RM  decision 
behavior  is  considered  to  have  reached  steady  state 
instantaneously  in  the  time  scale  of  interest.  The  advantage  of 
an  aggregated  model  is  that  it  includes  only  a  fraction  of  the 
number  of  states  in  the  original  model.  It  is  therefore  much 
more  amenable  to  practical  computation  than  the  original  model, 
which  is  computationally  intractable  even  for  simple  systems. 


The  approach  to  developing  simplified  models  is  based 


primarily  on  the  approximate  aggregation  theory  developed  in 
[1,2].  The  primary  result  from  this  development  is  summarized  by 
the  following  theorem: 

Theorem :  Given  a  perturbed  finite-state  semi-Markov  process 
z€(t)  whose  transition  operator  elements  P^.(t)  have  the 
following  dependence  on  €: 


with 


P  ij(t)  =  ( P  i  j  -  j  )hi  j  (t/€)  if  i,  j€Ek  (la) 

=  €qijh.j(t/€)  if  i€Ek,  j€Ek  (lb) 

£  p ■ j  =  1  and  where  p-j  and  q . ^  are  of  order  1  and 


th 


where  th'e  set  of  classes  { E  k  }  m  k  _ ,  is  disjoint  and 
exhaustive.  If  the  Markov  chains  defined  by  the  P-jj's 
within  a  single  class  Ek  represent  an  ergodic  Markov  process 
with  stationary  state  probability  distribution  { -rr  k  ^ }  for 
each  k(lsksm),  then: 


lim 


where : 
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=  y 
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and  where  t.j  is  the  mean  holding  time  for  the  holding  time 
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density  h^ft). 

The  proof  of  this  theorem  appears  in  [1]  and  some  extensions  and 
related  results  appear  in  [2]. 

Some  remarks  on  the  theorem  are  in  order: 

1.  Models  of  fault-tolerant  systems  have  structures  similar 

to  the  conditions  of  the  theorem  in  that  the  p  .  . 

^  J 

(unperturbed  within-class  embedded  transition 
probabilities)  are  of  order  1  and  the  embedded  transition 
probabilities  out  of  each  class  are  €-dependent  and 
usually  linear  in  €  where  6  is  related  directly  to  the 
component  failure  rates.  (However,  see  Remarks  3  and  4.) 

2.  The  usefulness  of  the  theorem  stems  from  the  fact  that  it 
provides  an  approximate  description  of  the  slow 
class-to-class  transition  dynamics  over  a  time  duration 
on  the  order  of  1/6  in  terms  of  a  finite  state  Markov 
process  with  only  as  many  states  as  there  are  classes. 
These  three  properties  (durations  of  order  1/6,  small 
number  of  states  and  standard  Markovian  behavior)  are  all 
desirable  for  fault-tolerant  system  evaluation 
calculations.  Once  the  approximate  interclass  behavior 
is  approximated  in  this  way,  the  individual  state 
probabilities  can  be  approximated  as: 

Prob  (occupy  state  i  at  time  t} 

=  •  Prob  (occupy  class  k  at  time  t }  (6) 

where  the  approximate  model  provides  the  probability  on 


the  right-hand  side. 

3.  Unfortunately,  f au 1 1 -to  1  e/an t  control  system  models  tend 
not  to  have  ergodic  classes  E^.  In  particular,  systems 
which  do  not  include  mechanisms  for  on-line  recovery  of 
components  declared  previously  to  have  failed  yield 
evaluation  models  that  include  trapping  states  and 
transient  states  in  most  of  the  classes  when  €  is  zero. 
For  example,  consider  a  system  whose  RM  logic  calls  for 
permanent  shutdown  of  a  component  upon  declaration  of  its 
failure  and  which  is  subject  to  false  alarms.  One  of  the 
states  in  a  model  for  such  a  system  is  characterized  by 
all  of  the  components  working  save  one  which  has  been 
shutdown  by  a  false  alarm.  Assuming  zero  probability  for 
component  failures  (i.e.  €  =  0 )  and  neglecting  further 
false  alarms,  this  state  becomes  a  trapping  state  in  the 
same  class  as  the  "all  working"  state.  Therefore,  the 
"all  working"  state  is  a  transient  state  and  the  class  is 
not  ergod i c  for  €=0 . 

Also  unfortunately,  the  holding  time  density  functions 
appearing  in  models  of  f a u 1 1  - 1 o 1 e r a n t  systems  tend  not  to 
have  the  dependence  on  €  exhibited  in  Eqn.  (1).  Consider 
the  meaning  of  h  ^  .  ( t  /€ )  for  very  small  € :  If  h .  .  ( t )  is  a 
typical  unimodular  density  over  [0,«]  (such  as 
exponential,  Erlang,  gamma)  with  mean  and  maximum 
location  both  of  order  1  in  t,  then  h  .  .  ( t / € )  approaches 


an  impulse  function  in  t  as  €  -”  0  .  In  fault-tolerant 
system  models,  €  represents  the  component  failure  rate 
while  h . . ( t )  is  typically  determined  by  the  distribution 
of  the  time  to  decision  for  the  appropriate  RM  test 
(particularly  when  i  and  j  are  both  elements  of  the  same 
class).  Thus,  h  .  .  ( t )  is  typically  not  dependent  on  the 
failure  rate  although  it  does  typically  have  its  mean  and 
its  maximum  location  at  small  values  of  t  relative  to  the 
length  of  a  mission. 

Our  research  to  the  present  has  been  directed  toward 
applying  the  results  of  the  Theorem  stated  above  to  two 
typical  fault-tolerant  control  system  models.  In  light  of 
Remarks  3  and  4,  much  of  our  recent  work  has  been  directed 
toward  modifications  of  the  results  of  the  Theorem  and 
investigation  of  the  effects  of  violating  some  of  the 
conditions  stated  in  the  Theorem.  The  next  Section  briefly 
describes  our  progress  in  these  areas. 

I  I .  Progr es  s  Summary 

Our  work  began  with  a  careful  examination  of  [1]  and  [2] 
with  regard  to  the  import  of  the  ergodic  assumption  discussed 
in  Remark  3  above.  This  examination  revealed  that  the  proof 
of  the  Theorem  above  depended  explicitly  on  the  existence  for 
each  class  E,  of  the  inverse  operator  [I-P.+  where: 


I  =  identity  operator 


P k  =  transition  probability  operator  for  the  embedded 

Markov  process  describing  transitions  within  class 
Ek  after  €  is  set  to  zero  (hence  eliminating 
out-of-class  transitions) 

"  k  =  steady  state  transition  operator  associated  with  P  k  , 

if  it  exists 
lira  1  ?  _  i 

=  n  i=i  Pk  otherwise 

As  is  stated  in  [2],  if  E  k  is  an  ergodic  class  when  6  =  0  then 

[I  -  Pk  +  it  k  ]  *  is  guaranteed  to  exist.  Hence,  the  ergodicity 

of  E  k  is  a  sufficient  condition  for  the  existence  of 
[I  -  Pk  +  it  k  ]  *  which  in  turn  is  a  sufficient  condition  for 
the  Theorem.  However,  ergodicity  is  not  necessary  for  the 
existence  of  the  inverse  operator.  As  a  particular  example, 
consider  a  class  consisting  of  all  transient  states  except  for 
a  single  trapping  state  where  all  of  the  transitions  exiting 
the  transient  states  enter  the  trapping  state.  Then  Pk 
contains  a  row  of  ones  in  the  position  of  the  trapping  state 
and  is  otherwise  filled  with  zeroes.  Then  irk  has  the  same 
form.  Therefore,  [I  -  Pk  +  tt  k  ]  ” 1  =  [I]"1  =  I,  hence  the 
inverse  operator  exists.  We  have  therefore  proven  the 
following: 

Proposition:  The  results  of  the  Theorem  above  are  true  if 

the  Markov  chain  defined  by  the  p  -  -  *  s  for  classes  {E.} 

I  J  K 

has  either  of  the  following  properties:  1)  it  is  ergodic 


with  stationary  state  probability  distribution  { ir  .  1  ; } 

(independent  of  the  initial  condition),  or  2)  the  inverse 
operator  [I  -  +  tt  k  ] "  *  exists  and  a  valid  steady  state 

probability  distribution  {  it  .  v  ' }  can  be  found  such  that 
tt ^  operating  upon  it  reproduces  it  (and  it  may  be 
dependent  on  the  initial  condition,  which  must  then  be 
known)  . 

The  fundamental  importance  of  this  proposition  to 
f a u 1 t - t o 1 e r a n t  system  evaluation  is  clear  from  Remark  3. 
Furthermore,  it  is  relatively  s t r a i gh t f o rwa rd  to  numerically 
calculate  m  ^  from  P  ^  and  to  compute  {  -it  ^  t  ^ }  from  it  ^  and  the 
given  initial  condition.  It  is  then  possible  to  numerically 
evaluate  the  eigenvalues  (or  singular  values)  of  I  ■  +  tt  k , 

which  leads  finally  to  an  indication  that  the  approximate 
results  of  the  Theorem  hold  (and  also  produces  the  required 
steady  state  distribution  {  it  ^  K  ^ }  for  each  k)  or  to  an 
indication  that  the  results  of  the  Theorem  do  not  hold. 

We  l n e n  proceeded  to  construct  some  typical  models  of 
f aul t-tolerant  systems  in  order  to  test  the  validity  of  the 
results  and  to  examine  the  conditions  under  which  the 
a ppr ox i ma t i on  fails.  Two  models  of  fault  tolerant  system 
behaviour  have  been  developed  by  Wereley.  Both  are  based  on 
the  single-component  dual-redundant  (SCDR)  system.  This  is 
the  simplest  fault  tolerant  configuration  that  may  be 
modelled.  It  consists  of  a  primary  component  and  a  backup 


component  with  an  independent  failure  detection  test 
monitoring  each.  The  RM  logic  is  simply  to  use  the  primary 
component  until  its  test  indicates  that  it  has  failed  at  which 
time  a  switch  is  made  to  the  backup  unless  it  is  already 
indicated  to  be  failed. 

The  first  model  is  assumed  to  have  a  sequential  detection 
test  for  each  component  with  decision  time  mass  functions  of 
the  hypergeometric  type.  No  recovery  from  false  alarms  is 
permitted.  This  particular  model  has  seven  states  (it  should 
be  noted  that  the  10-JAN-85  report  stated  incorrectly  that 
this  was  a  four  state  model)  which  decompose  into  three 
distinct  classes  as  the  probability  of  component  failure  in  a 
single  time  step  tends  to  zero.  Each  of  the  three  classes  is 
non-ergodic  due  to  the  existence  of  a  trapping  state  in  each. 
This  model  is  simple  enough  that  analytical  transform 
techniques,  as  well  as  numerical  computations,  may  be  used  to 
analyze  its  behaviour. 

The  second  model  is  also  assumed  to  have  a  sequential 
fault  detection  test  for  each  component  with  hypergeometric 
decision  time  mass  functions.  However,  this  model  includes  a 
false  alarm  recovery  (FAR)  test  which  is  triggered  by  a 
detection  indication.  This  FAR  test  is  simply  the  same 
sequential  test  as  for  detection  operating  on  the  indicated 
component  (that  is,  the  component  that  was  indicated  as  failed 
by  the  fault  detection  logic).  The  model  has  nine  states 


which  again  decompose  into  three  distinct  classes  as  the 
probability  of  failure  tends  tp  zero.  Each  of  the  classes  is 
now  ergodic  due  to  the  presence  of  the  FAR  test.  An  attempt 
is  underway  to  analyze  this  model  using  analytical  transform 
techniques  using  the  MACSYMA  symbolic  manipulation  software 
package.  However,  the  large  number  of  states  may  make 
symbolic  manipulation  impractical.  In  that  case,  numerical 
computations  will  be  used  to  determine  the  behaviour  of  this 
model  . 

A  FORTRAN  program  has  been  developed  to  numerically 
describe  the  behaviour  of  the  two  models.  Work  is  currently 
progressing  on  both  the  transform  analysis  and  the  application 
of  the  approximation  techniques  to  these  two  relatively  simple 
mode  Is. 

Meanwhile,  work  has  also  been  started  on  a  more  complex, 
more  realistic  f a u 1 1  - 1 o 1 e r a n t  system  model  similar  to  the  one 
described  in  [3],  Kwong  has  constructed  a  model  for  a  fault 
tolerant  system  with  3  redundant  components,  which  employs  the 
Vector  Shiryayev  Sequential  Test  (VSST)  (see  [3])  to  identify 
and  isolate  failed  components.  The  model  is  a  9-state 
continuous  parameter  semi-Markov  process.  In  the  system,  when 
a  component  is  isolated  by  the  VSST,  a  self-test  is  initiated 
and  the  component  will  be  brought  back  into  operation  when 
there  are  two  consecutive  no-failure  indications  from  the 
self-test.  The  semi-Markov  model  can  be  decomposed  into  3 


disjoint  classes  when  the  failure  rate ,  €  ,  of  each  component 
is  equal  to  zero.  The  classes  are  ergodic.  Numerical  results 
for  €  =  2.5  x  10e-6  failures  per  second,  show  that  the 
normalized  probabilities  of  occupying  each  state  within  each 
class  converge  to  the  steady  state  probabilities  of  the 
non -perturbed  system  for  each  class.  Also,  the  steady  state 
probability  distribution  of  the  non -perturbed  system  can  be 
evaluated  analytically.  Therefore,  the  state  probabilities  of 
the  perturbed  system  can  be  approximated  analytically  if  the 
total  probability  within  each  class  is  known  as  a  function  of 
time. 

The  results  of  Korolyuk  and  Turbin's  theorem  can  be 

applied  to  obtain  the  approximate  total  probabilities  within 

each  class  when  time  is  scaled  by  a  time-scaling  factor. 

Results  of  the  analytical  calculation  and  of  a  complete 

numerical  calculation  are  compared  in  Table  1  in  terms  of  the 

total  probability  of  occupying  each  class.  As  one  would 

expect,  the  results  are  in  relatively  close  agreement,  never 

differing  by  more  than  about  10%.  By  examining  other  values 

-  5 

of  €,  it  was  empirically  found  that  €  >  10  led  to 
discrepancies  in  the  results.  It  is  of  interest  to  note  that 
this  is  just  two  orders  of  magnitude  smaller  than  the  smallest 
decision  time  rate  assumed  for  the  tests  (see  Table  1). 


Table  1.  Exact  Results  vs.  Approximate  Results  for  Class 
Probabilities:  9-State  Model 
Model  parameters: 

Component  failure  rate:  2.5  x  1Q~6  sec 
False  alarm  decision  rate:  10"8  sec 


Detection 

dec i s  ion  rate  : 

.05  sec  ' 

Recovery 

test  false  alarm 

rate :  .05 

sec  ~  8 

Recovery 

test  recovery  rate:  .1  sec" 

1 

Recovery 

test  validation 

rate:  .  1  sec'1. 

Recovery 

test  miss  rate: 

. 05  sec 

Probability  of 

Occupying 

Class 

(Exact:  top  line; 

Approximate 

:  bottom) 

T i me ( s  ec . ) 

Class  1 

Class  2 

Class  3 

40 

.9997001 

.0002999 

3  x  10"8 

.9997000 

.0002999 

3  x  10"8 

280 

.9979038 

.0020949 

.0000013 

.9979022 

.0020963 

.0000015 

600 

.9955171 

.0044769 

.0000061 

.9955101 

.0044832 

.0000067 

1200 

.9910621 

.0089137 

.0000241 

.9910404 

.0089328 

.0000269 

1600 

.9881047 

.0118525 

.0000428 

.9880717 

.0118806 

.0000477 

We  have  also  constructed  a  four-state  model  for  the 


purpose  of  examining  the  impact  of  nonergodicity  on  the 
results.  The  basic  four-state  model  involves  "fast" 
transitions  between  states  1  and  2  with  "slow"  transitions 
occurring  between  these  states  and  the  remaining  two  states. 
All  transitions  are  semi-Markov  in  nature  with  second-order 
h y per e x pon en t  i  a  1  holding  time  pdf's.  Several  variations  of 
this  model  are  currently  under  study,  some  with  ergodic 
classes  and  some  with  nonergodic  classes.  The  results  are  now 
being  generated  and  should  be  available  late  in  the  Fall. 

We  have  also  partially  completed  an  analytical  effort  to 
circumvent  the  difficulty  discussed  in  Remark  4  above.  This 
involves  the  introduction  of  a  second  small  parameter  into  the 
description  of  the  process  as  a  time-scaling  parameter. 

Recall  from  Eqn.  (1)  that  in  order  to  apply  the  results  of 
Korolyuk's  theorem  or  our  proposition,  the  transition  kernel 
elements  of  the  perturbed  process  had  to  take  the  form: 


P  „*„(*/€) 


where  P^-j  is  proportional  to  €  for  interclass  transitions  but 
is  asymptotically  (as  €  -*  0)  independent  of  €  for  intraclass 


transitions.  Consider  now  replacing  this  form  by: 


»  Uhij<t/S> 


where  pV  .  is  the  same  as  before.  The  asymptotic  results  of 
the  theorem  remain  true  if  6  is  such  that  5  =  C€  with 
€  €  C  «  — ).  This  can  be  seen  from  Korolyuk's  proof  [1].  Note 


that  5  is  now  a  time-scaling  parameter  which  is  small.  Now 
consider  finding  asymptotic  results  as  €  and  6  approach  zero 


separatel y .  In  effect,  this  is  what  we  have  done  with  our 
modification.  We  do  not  have  a  rigorous  proof  as  yet  that  the 
results  are  still  asymptotically  true  (except  for  the  case 
cited  above),  but  the  empirical  results  so  far  have  all  been 
supportive.  We  will  expand  on  this  topic  in  the  coming 
months . 

III.  Papers  and  Presentations  Derived  from  This  Work 

A  presentation  was  made  at  the  following  Workshop: 

AFOSR  Workshop  on  Reliability 
Skyland  Lodge 

Shenandoah  National  Park,  Virginia 
May  28-31,  1985 

The  abstract  and  viewgraphs  from  this  presentation  appear  in 
Appendix  I . 

A  brief  reference  to  this  work  is  contained  in  a  paper 
presented  at  a  Conference  in  July: 

B.K.  Walker  &  O.K.  Gerber,  "Evaluation  of  Fault-Tolerant 
System  Performance  by  Approximate  Techniques",  Proc .  of 
7th  IFAC  Symp.  on  Identification  &  System  Parameter 
Estimation,  York,  UK,  July  1985. 

A  copy  of  this  paper  appears  in  Appendix  II. 


IV.  Projections  for  Second  Year  of  Work 

Our  efforts  will  continue  on  the  models  discussed  in  this 
report.  Of  particular  interest  will  be  the  results  for  models 
with  nonergodic  classes  and  the  efforts  to  employ  two  small 
parameters  in  the  description  of  the  model.  Furthermore,  we 
expect  to  be  able  to  analyze  some  of  the  smaller  models 
analytically  using  a  symbolic  manipulation  program  called 
MACSYMA.  This  will  allow  us  to  derive  closed  form  transform 
solutions  for  the  smaller  models  which  can  be  examined  easily 
for  their  asymptotic  properties  as  the  failure  rate  parameter 
becomes  smal 1 . 

We  also  plan  to  construct  a  more  realistic  model  similar 
to  the  nine-state  model  cited  above.  This  model  will  use 
actual  holding  time  pdf  data  derived  from  simulations  of 
sequential  fault  diagnosis  tests  such  as  the  VSST.  This  will 
provide  the  information  necessary  to  apply  the  approximate 
results  that  we  have  derived  (or  modified)  to  a  realistic 
fault-tolerant  system  model. 

In  light  of  the  many  questions  that  have  arisen  as  part 
of  our  inquiries  (particularly  regarding  the  rigorous 
justification  for  some  of  our  analytical  results),  we 
anticipate  the  need  for  further  work  and  hence  we  plan  to 
submit  a  proposal  to  continue  this  work  beyond  next  summer. 
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V.  Financial  &  Manpower  Status 

The  manpower  remains  as  it  was  proposed.  Prof.  Bruce  K. 
Walker  devotes  approximately  20%  of  his  time  to  the  project, 
primarily  in  a  supervisory  capacity.  Two  graduate  students, 
Siu-Kwong  Chu  and  Norman  M.  Wereley,  work  as  full-time 
graduate  research  assistants  on  the  project.  Margaret  McCabe 
devotes  approximately  10%  of  her  time  to  the  project  for 
clerical  support.  No  changes  are  anticipated. 

With  regard  to  the  financial  status,  a  substantial  cost 
underrun  occurred  in  the  first  year  of  the  project  primarily 
due  to  the  timing  of  the  project.  A  proposal  will  be 
submitted  to  carry  over  the  leftover  funds  into  the  second 
year.  Any  significant  changes  proposed  for  the  second  year  of 
funding  will  accompany  that  proposal. 
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These  thr^e  presentations  will  summarize  our  work  to  date  on  decomposition 
methods  applied  to  Markovian  models  of  fault-tolerant  system  behavior.  Sucn 
models  are  very  useful  as  design  tools  for  the  evaluation  of  the 
reliability  and  performance  cf  various  fault-tolerant  system  designs. 

First,  we  shall  present  the  concept  of  modelling  fault-tolerant  system 
behavior  by  Markovian  models.  We  shall  discuss  the  construction  of  Markov 
models  for  systems  which  use  on-line  fault  diagnostic  tests  of  the  "single 
sample"  variety.  This  will  illustrate  the  generality  of  this  modelling 
method  and  the  useful  performance  results  which  can  De  generated  by  such 
models.  It  will  also  illustrate  some  of  the  practical  problems  that  arise 
when  complex  systems  are  considered.  We  snail  then  discuss  the  extension 
of  these  modelling  techniques  to  systems  which  use  "sequential"  on-line 
diagnostic  tests.  This  requires  the  generalization  of  the  modelling 
technique  to  include  semi -Markovian  models.  It  leads  to  further 
applicability  of  the  modelling  method  and  also  to  further  practical 
problems  for  complex  systems. 

Next,  we  shall  present  as  an  illustrative  example  the  relatively  simple 
case  cf  a  single  dual -redundant  component  with  on-line  diagnostics  which 
are  used  to  implement  a  primary /backup  operating  strategy.  The  Markov 
model  for  this  system  will  be  presented  and  reliability  results  will 
demonstrate  that  as  the  component  mean  time  to  failure  (MTTF)  becomes  larqe 
relative  to  the  time  increment  between  fault  diagnosis  testing  a 
decomposition  of  the  model  becomes  apparent.  A  semi-Markov  model  will  then 
be  developed  for  this  system  and  similar  results  will  be  presented.  We 
shall  then  discuss  our  efforts  to  generate  analytical  results  based  upon 
the  decomposition  of  this  model  when  the  holding  time  densities  of  the 
semi-Markov  model  are  of  a  particularly  simple  yet  relevant  form  (namely 
hypergeometric  of  order  2). 


Finally,  we  shall  review  our  progress  since  last  fall  on  our  efforts  to 
apply  tne  analytical  results  of  Korolyuk  and  Turbin  to  our  mode’s.  We 
snail  discuss  the  points  at  which  -node’s  of  faul t-tcle^ao*  systems  viol 
the  sufficient  conditions  for  application  of  those  results  and  hew  the 
conditions  can  b°  relaxed  or  modified.  We  snail  also  present  some 
numerical  results  that  indicate  tne  need  for  extension  of  some  of  tne 
approximet  io.os.  We  snail  discuss  our  approach  to  achieving  such  an 
extension. 
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HAS  THE  FOLLOWING  REDUNDANCY  MANAGEMENT  (RM)  POLICY  i 


•  EACH  INSTKJWT  HAS  AN  INDEPENDENT  FAILURE 

DETECTION  TEST 

•  THE  PRIMARY  INSTRUMENT  IS  USED  UNTIL  A  FAILURE 

OF  THE  PRIMARY  IS  INDICATED ,  IN  WHICH  CASE 
THE  PRIMARY  IS  TURNED  OFF  AND  THE  BACKUP 
INSTRUMENT  18  USED 

•  THE  FAILURE  DETECTION  TESTS  ARE  TURNED  OFF 

AFTER  THE  FIRST  INDICATED  FAILURE 

•  TEE  SYSTEM  WORKS  IF  A  WORKING  XMSTRUKNT  IS 

RBXBG  USED 


A  HOMOGENEOUS  MARKOV  MOOEL  WILL  RE  DEVELOPED  FOR  THE  ABOVE 
RM  POLICY*  NOTE  THAT  A  FAILURE  DETECTION  DECISION  IS  AVAILABLE 
AT  EVERY  TIM  STEP. 


STATE  DEFINITIONS 


1.  BOTH  INSTRUMENTS  WORKING 

2.  ONE  INSTRUICNT  WORKING,  THE  OTHER  TURNED  OFF  DUE  TO  AN 
INDICATED  FAILURE 

3.  BACKUP  COMPONENT  FAILED  UNCOVERED 

4.  SYSTEM  LOSS 


BOTH  THE  PROBABILITY  OF  FAILURE  OVER  ONE  TIME  STEP  Pf  AND  THE 
PROBABILITY  OF  FALSE  ALARM  OVER  ONE  TIME  STEP  Pfa  ARE  SMALL 
NUMBERS  -  TYPICALLY  10“*, 


SINGLE-STEP  STATE  TRANSITION  PROBABILITY  MATRIX 


l-2(Pf+Pfa) 

0 

0 

0 

2(Pfa+cPf) 

1-Pf 

c(l-Pf-Pfa) 

0 

(l-c)Pf 

0 

(1-c) (1-Pf-Pfa) 

0 

<1-0  Pf 

Pf 

Pf-tPfa 

1 

HEW  RM  POLICY 


•  EACH  INSTRUMENT  HAS  AN  INDEPENDENT  SEQUENTIAL 
PROBABILITY  RATIO  TEST  (SPRT)  TO  DETECT  FAILURES 

•  BOTH  TESTS  AXE  RESET  ON  ANY  NOMINAL  INDICATION 

•  DECLARE  FAILURE  OF  AN  INSTRUMENT  WHEN  ITS  SPRT 
INDICATES  A  FAILURE 

•  THE  PRIMARY  INSTRUMENT  IS  USED  UNTIL  A  FAILURE  OF 
THE  PRIMARY  IS  INDICATED,  IN  WHICH  CASE  THE  PRIMARY 
IS  TURNED  OFF  AND  THE  BAOCUP  IS  USED 

•  THE  FAILURE  DETECTION  TEST  IS  TURNED  OFF  AFTER  THE 
FIRST  INDICATED  FAILURE 

•  THE  SYSTEM  WORKS  IF  A  WORKING  INSTRUW.HT  IS  BEING 
USED  BUT  IS  ROBUST  ENOUGH  TO  SUSTAIN  A  FAILURE  FOR 
A  "WHILE" 


STATE  DEFINITIONS  FOR  THE  SEMI-MARKOV  MODEL 

1.  BOTH  INSTRUMENTS  WORKING 

2.  ONE  INSTRUMENT  WORKING,  THE  OTHER  TURNED  OFT  DUE  TO  A  FALSE 
ALARM 

3.  ONE  INSTRUMENT  WORKING,  THE  OTHER  TURNED  OFF  DUE  TO  A 
COVERED  FAILURE 

4.  PRIMARY  INSTRUMENT  FAILURE  -  NO  INDICATION  YET 

5.  BACKUP  INSTRUKNT  FAILURE  -  NO  INDICATION  YET. 

6.  SYSTEM  LOSS 

SEMI-MARKOV  TRANSITION  DIAGRAM 


interval  transition  probabilities  matrix  is  generataJ  by  the 
followin'-;  equation, 
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NORMAL I ZED  STATE  PROBABILITIES  TRAJECTORY 


An  Approximation  (Korolyuk  &  Turbin,  1976) 


Perturbed  semi-Markov  chain:  C  (t)  with  e  a  small 

parameter,  „  „ 
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State  space  partitions  into  nu  classes  E, ,  ....  E 
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where  kernel  matrix  is: 
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,  stationary  distribution  } 

:  if  each  class  is  eraodic) , 

Then:  As  e  ■*  0 
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Prob  U  (r)  *  i)  ~  K  1  •  Prob  {y(t)  =  k}  (i  E.) 

^  *  K 

where  v(t)  is  a  Markov  process  representing 
class- to-class  transition  behavior  with 
mean  time  to  transition  dependent  on  mean 
holding  times  and  q. .'s. 


Typical  fault- tolerant  system  model: 


•Most  models  do  not  have  ergodic  classes 


•  Eraodicity  is  sufficient  but  not  necessary . 

*  Also  sufficient  is  existence  of  the  inverse  operator 


4 or  el*u 


[I  -  P  +  IT] 


where:  P  is  interval  transition  probability  operator, 

n  is  Cesaro  limit  of  successive  P  operations,  i.e. 


(generates  stationary  operator  if  a  stationary 
distribution  exists) 


•  Can  check  for  this  existence  numerically. 


Problem:  e  -  dependence 

Our  first  case: 


Holding  time  pdf  no t  dependent  on  e  (which  correspond 
to  failure  rate  in  fault- tolerant  system  models) . 


k/r’ 


(Yields  Markovian  behavior  of  Y  in  original 
approximation. ) 


’  y . 


SEMI -MARKOV  KHCEL  FOR  THE  CLASS  TO  CLASS  PROCESS 


Laplace  transform  of  semi-Markov  kernel  for  the  process 
starts  from  state  i  in  class  and  moves  to  Er 
Laplace  transform  of  semi-Markov  kernel 


assume  ^ r  ( s )  is  independent  of  superscript, 
then  one  can  deduce 


„  X.  TT*'  x  if  k.-(  W 

%r  cs)  =  _ 1 _ 

21  'TT^  Z  ^  Ha  Is) 


WHAT  FORM  Is  fr. .  (s)  LIKE  FOR  THE  MODEL 


4 1 


assume  X*  ,  X*.  ,  have  the  numerical  values  such  that  "n'.'V'N^and  n] 

TT,a'^''  KmCS) 


W)  ■ 


T^a,  Hi 

Cs) 

3S  •+  HX,  **•  <*£ 

^S"t  (Ac-rifc)^ 

t  t  A,* 

n) 

A,  (  v 

?  ;  -  *.  a  . 

+  0  r 

c  A.  f  '  :  t 


-  ?1 

b  Xc 


as  £  -*C 


y*  (i)cces  not  depend  on  €  (?) 


MATCHING  THE  CONDITIONS  STATED  IN  THE  PAPER 


propose  the  semi-Markov  process  depending  cntwo  small  parameters  C  and  S 


£  :  class  to  class  transition  parameter 

8  :  time  scaling  parameter 


as  £  and  S  -*  0 


t  9  ^  U  .  | 

skU  ifwt 

*  semi -Markov  process^E  can  be  split  into  disjoint  classes  of 
states  f  E. 

♦•I  K 

*  the  sojourn  of  the  process  in  a  given  state  tends  to  zero 


trying  •  validate  two  small  parameters  method  and  deduce 
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CURRENT  WORK 


•CONTINUE  "SIMPLE"  MODEL 
•  NCNERGODIC  CLASSES 

•SMALL  ENOUGH  FOR  ANALYTICAL  TRANSFORMS 
•FURTHER  NUMERICAL  RESULTS 

•USE  TWO  SMALL  PARAMETERS  TO  EXTEND  ANALYSIS  OF 
LARGER  MCE EL 

•HAS  ERGODIC  CLASSES 

•FURTHER  EXAMPLES 
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EVALUATION  OF  FAUIT-TCLERANT  $VSTE»  BCOFORMANCE  8Y  APPROX I-ATE  TECHNIQUES 
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Abstract.  An  approximate  metnod  for  calculating  the  statistics  of  the  performance  of 
a  "au't-tolerant  system  is  developed.  An  approximate  method  is  necessary  because  the 
statistical  model  of  the  system  penavior  is  large-scale  and  the  time  horizon  of 
interest  encompasses  many  cycles  of  the  Redundancy  Management  logic.  In  tne 
development,  a  compact  representation  of  tne  necessary  information  called  the 
v-transform  is  introduced  and  discussed.  Based  upon  this  representation,  an 
approximation  that  leads  to  a  very  efficient  computational  procedure  is  suggested  and 
numerically  analyzed.  A  very  brief  discussion  of  otner  related  work  is  also 
presented. 

keywords.  System  failure  and  recovery;  reliability  theory;  Markov  processes; 
stocnastic  systems;  numerical  metnods 


I.  Introduction 

The  use  of  imbedded  microprocessors  and  otner 
computational  devices  in  the  implementations  of 
control  system  designs  has  given  tne  designer  of 
such  systems  the  freedom  to  synthesize  very  complex 
.  control  scnemes.  The  motivation  for  using  sucn 
c  sophisticated  designs  is  the  significant 
'  enhancement  of  the  system  performance  wmch  can  be 
r  obtained  relative  to  designs  which  use  very  crude 
:  control  strategies.  These  sophisticated  designs 
*  often  involve  the  use  of  many  sensing  and  actuating 
_  components  in  an  integrated  control  scheme.  The 
^components  are  often  subject  to  failure  or  damage, 
"'and  it  is  often  the  case  that  tne  system 
performance  degrades  dramatically  or  even  becomes 
unacceptable  or  unsafe  when  one  or  more  of  tne 
“components  ceases  normal  operation.  Examples  of 
sucn  systems  include  digital  flight  control  systems 
r  for  statically  unstable  aircraft  (such  as  the 
-X-29),  the  flight  and  engine  control  systems  for 
VTOL  aircraft,  the  attitude  and  shape  control 
-systems  for  large  space  vehicles,  and  tne  control 
-systems  of  nuclear  power  plants. 

The  fundamental  importance  of  certain  components  to 
:  tne  acceptable  or  optimal  performance  of  the 
control  system  nas  led  to  the  incorporation  of 
redundancy  and  fault-tolerance  into  such  systems. 
Fault-tolerance  may  be  achieved  either  by 
replicating  the  hardware  components  which  are 
subject  to  faults  or  by  implementing  a  system  which 
provides  functional  redundancy  amonq  its 
components.  In  either  case,  the  automatic  control 
system  is  then  obliged  to  manage  this  redundancy  by 
monitoring  the  comoonents  for  faults  and  selecting 
tne  comoonents  to  oe  used  in  real  time.  This 
function  of  the  automatic  system  is  referred  to  as 
Redundancy  Management  (RM).  Its  inolementation  can 
be  as  simple  as  a  passive  signal  selection  scheme 
from  among  replicated  identical  sensors,  or  as 
complex  as  a  sophisticated  configuration  selection 
scneme  based  on  automated  logic  which  utilizes 
elements  of  statistical  decision  theory. 

The  presence  in  the  system  implementation  of  a 
Redundancy  Management  function  lends  a  different 
meaninn  to  tne  concept  of  system  performance.  The 


optimal  design  performance  will  only  be  achieved 
(or  approached)  if  all  of  the  comoonents  remain 
operational  and  the  RM  function  performs 
flawlessly.  If  either  of  these  conditions  are 
violated,  the  system  will  in  general  perform  less 
tnan  optimally.  This  suggests  that  the 
“performance"  of  tne  system  is  not  the  optimal 
performance  that  is  attainable  wnen  everything  is 
working  properly  out  rather  is  a  random  variable 
which  reflects  tne  occurrence  of  random  component 
failures  and  ransom  RM  decision  errors.  The 
statistical  properties  of  this  random 
"performance"  value  is  of  great  interest  to  the 
designer  of  the  system.  It  is  the  calculation  of 
tnese  statistical  performance  properties  with 
wmch  we  shall  concern  ourselves  in  this  paper. 

The  computational  algorithms  which  result  from  th- 
aoalysis  can  be  thought  of  as  design  tools  for  th-. 
fault-tolerant  control  system  designer. 

Since  component  failures  and  RM  decisions  can  bot- 
be  characterized  as  'andom  events,  one  of  the 
primary  steps  in  the  development  of  a  performance 
evaluation  metnod  is  the  construction  of  a 
stochastic  model  for  tnose  aspects  of  tne  behavio' 
of  tne  system  which  govern  the  performance.  Ther= 
exist  two  aoproacnes  to  this  modelling  task;  the 
combinatorial  method  [1]  and  tne  method  of 
generalized  Markovian  models  [2].  It  has  been 
snown  that  the  former  method  is  far  more  unwieldy 
tnan  the  latter  when  it  is  necessary  to  account 
for  tne  time  ordering  of  the  random  events  which 
may  take  place  during  a  mission  [3, A].  Since  the 
system  performance  may  be  impacted  dramatically  bv 
such  time-ordered  events,  this  makes  the  latter 
metnod  far  more  attractive.  Henceforth,  we  snail 
assume  that  tne  model  to  be  dealt  with  is  of  the 
generalized  Markovian  type,  i.e.  tnat  tne  model  is 
a  finite  state  Markov  or  semi-Markov  process  whose 
states  correspond  to  tne  various  possible 
combinations  of  failure  events  and  RM  decision 
events  that  can  occur.  This  oaoer  will  empnasize 
discrete  parameter  models.  Similar  analyses  hold 
for  continuous  parameter  models. 

When  oeneralized  Markovian  models  are  used  for 
performance  evaluation  of  realistically  complex 


systems,  a  dimensional  it>  problem  ar,ses.  Comdex 
fault-tolerant  systems  tena  to  requ’r®  many  states 
for  tneir  accurate  characterization.  Furthermore , 
tne  operating  time  (or  mission  time  for  such 
systems  tenas  to  De  long  relative  tc  tne  operating 
cycle  time  of  tne  RM  system.  Tnerefore,  tne 
operating  times  of  interest  are  sucn  tnat  tne 
moael  must  oe  propagatea  for  many  R“  cycle  times. 

A  -urtrer  aimensional ity  problem  is  enqenaerea  Dy 
tr.e  fact  tnat  tne  system  performance  may  oe  a 
function  of  tne  entire  nistory  of  failure  ana  RM 
aetision  events.  Tnese  factors  all  comome  to 
prcauce  an  explosion  of  tne  memory  size  ana  tne 
numoer  cf  computations  reauireo  to  evaluate  tne 
system  performance.  Unfortunately,  tne 
simplifications  tnat  are  oossiole  Py  using  steaay 
state  analysis  of  sucn  moaels  are  not  applicaole 
oecause  tne  ooerating  time  of  a  fault-tolerant 
system  tenas  to  oe  only  a  small  fraction  of  the 
mean  time  between  failure  events.  Tnerefore,  tne 
transient  Penavior  of  tne  moael  is  of  interest 
wmle  tne  steaay  state  penavior  is  not. 

In  this  paper,  we  aiscuss  some  tecnniques  that  are 
currently  unoer  oevelopment  tnat  leaa  to 
approximate  results  for  performance  evaluation. 
First,  we  aiscuss  a  metnod  for  aiscrete  parameter 
Marxovian  moaels  of  fault-tolerant  systems  tnat 
involves  tne  introOuction  of  a  "performance 
transform. “  8y  approximating  tne  penavior  of  the 
transform,  it  is  oossible  to  generate  approximate 
results  for  the  probability  mass  function  of  the 
ranoom  performance  value.  A  means  for  implementing 
this  approximation  is  suggestea  wnich  makes  use  of 
an  alternative  evaluation  of  tne  exoectea  value  of 
tne  performance.  Suoseouentl y,  a  metnoa  for 
continuous  parameter  moaels  is  briefly  aiscussea 
wmcn  exploits  tne  typical  separation  of  time 
scales  between  tne  failure  event  nistory  ana  the 
AM  aet’sion  nistory. 


2.  performance  transform  method 

The  behavior  of  many  faul t-tol erant  system  designs 
can  be  captured  Py  a  finite  state  Markov  process 
witn  aiscrete  time  parameter.  The  states  of  such  a 
moael  represent  the  various  operational  states  of 
tne  fault-tolerant  system.  They  are  characterized 
by  tne  operational  status  of  eacn  of  tne 
components  and  by  the  status  of  each  of  the 
automatic  fault  diagnosis  tests.  For  example,  a 
typical  state  in  a  moael  for  a  fault-tolerant 
inertial  measurement  unit  would  be  characterized 
by  tne  gyros  and  accelerometers  which  were  still 
working,  those  tnat  nad  already  failed,  and  those 
tnat  nad  been  eliminated  from  use  by  the  RM 
function  (note  that  the  latter  two  sets  need  NOT 
be  identical)  plus  the  status  of  all  of  the  fault 
detection  and  isolation  tests  whicn  the  RM  logic 
uses.  If  it  can  be  assumed  tnat  tne  time  of 
failure  for  each  component  is  exponentially 
distributed  (and  nence  is  generatec  ov  a 
memoryless  process)  and  that  eacn  fault 
diagnostics  test  operates  only  on  instantaneous 
data  (and  is  therefore  a'so  memoryless),  then  tne 
various  combinations  of  failure  events  and  test 
outcomes  can  oe  formed  wmcn  represent  transitions 
of  state  'or  tne  svstem.  If  the  orooabi 1 i t ies  of 
tnese  transitions  can  oe  derived,  tnen  tne  state 
de'imtions  and  the  transition  probabilities  taxen 
together  constitute  a  Marxov  model  for  tne 
evolution  of  the  system  configuration.  These 
moaels  nave  been  used  extensively  m  recent  years 
fo'  tne  calculation  of  tne  reliability  of 
fault-tolerant  systems  [5, 6, 7, 3]. 


'wnen  tne  operational  state  cf  tne  system  is  sucn 
tnat  fewer  than  the  nominal  number  cf  components 
are  being  used  or  su.n  that  some  of  tne  componem 
in  use  are  no  longer  operating  normally,  tner  t-s 
system  performance  is  aegraoed.  Deoencmg  „oon 
the  history  of  such  non-oominal  conditions,  tne 
Overall  performance  of  tne  system  m  executing 
task  will  also  suffer.  Let  s  oe  tne  integer  inoe 
of  tne  state  occupied  at  time  step  k  by  a  aiscret 
parameter  Marxov  model  of  the  svstem  penavior. 
Assume  that  J.  (s^)  is  tne  contribution  to  tne 
overall  systeft  performance  of  occupying  state  sk 
at  time  step  k  ano  tnat  tnese  contributions  are* 
cumulative  so  that  tne  overall  system  performance 
is  given  by: 

Cost  =  2L  JM  fcJ 

k-i 

Clearly,  this  overall  performance  value  will  De  a 
function  of  tne  time  history  of  the  operational 
state  (or  OSH,  for  Operational  State  History)  anc 
because  each  OSH  is  a  sample  function  of  a  ranoom 
process,  the  performance  value  will  be  a  ranoom 
variable.  It  is  possible  to  compute  the 
probability  of  occurrence  of  each  and  every  OSH 
from  tne  single-step  transition  probability  matn 
P  of  the  Markov  model  and  the  initial  state 
probability  vector  *<„,  which  is  usually  known  anc 
freguently  consists  Of  unity  for  the  probability 
of  initially  occupying  a  state  characterized  by 
all  normal  components  and  zeroes  for  all  tne  othe 
initial  state  probabilities.  Once  the  probability 
of  each  OSH  is  known,  tne  entire  probability  mass 
function  (pmf)  of  tne  performance  value  can  be 
constructed,  ana  the  problem  is  solved. 


Unfortunately,  the  number  of  OSHs  expands  very 
rapidly  with  elapsed  time.  If  tne  model  consists 
of  S  states  whicn  form  a  single  communicating 
class,  then  the  numoer  of  distinct  OSHs  may  be  as 
large  as  S*  where  x  is  elapsed  time  since  tne 
mission  began.  As  was  discussed  in  the 
Introduction,  the  elaosed  times  of  interest  are 


frequently  large  relative  to  the  RM  cycle  time  an 


S  itself  is  frequently  large.  As  a  result,  the 


number  of  distinct  OSHs  becomes  unmanageably 


large. 


Fault-tolerant  systems  frequently  have  the 
property  that  component  repair  is  not  feasible 
during  a  mission  and  hence  need  not  be  considered 
In  this  case,  the  system  configuration  can  only 
degrade  due  to  failures  or  incorrect  RM  dec’Sions 
Also,  all  fault-tolerant  system  models  include  a 
state  tnat  represents  configurations  which  are  sc 
degraded  tnat  they  are  unacceptable.  This  state  ■ 
the  system  loss  (Si)  state,  and  it  is  a  trapping 
state  wnen  repair  is  not  possible.  These 
c ’’■cumstances  lead  to  a  situation  wnere  the  numbe 
of  distinct  OSHs  that  a  system  can  exhibit  is  not 
exponential  in  the  number  of  states.  It  is 
sometimes  possible  to  snow  that  the  numoer  of 
distinct  OSHs  is  bounced  Dy  a  linear  function  of 
time.  Nonetheless,  even  in  the  latter  case,  tne 
numoer  of  OShs  quickly  grows  to  a  value  tnat  is 
beyond  the  memory  capability  of  even  large 
mainframe  computers.  This  motivates  tne  search 
for  approximate  metnods  to  compute  the  statistics 
o'  the  svstem  performance,  we  snail  now  present 
sucn  a  method. 


Consider  a  finite  state  Markov  model  of  a 
fault-tolerant  system  comprising  N  states,  one  of 
wmcn,  namely  s^,  is  tne  SL  trapping  state. 
Associated  with  tne  occupancy  of  each  state  for  a 
single  time  step  is  an  integer-valued  performance 
measure.  The  assumption  of  integral  values  here 
is  not  restrictive  because  a  general  performance 


measure  can  be  resolved  to  the  integers  by 
discretizing  ’ts  value.  At  this  point,  *e  shall 
also  assume  that  the  performance  values  are 
time-invariant.  If  they  are  t ime-vary mgf  the 
algeDra  becomes  considerably  more  cumoersome,  but 
the  results  cited  below  hold  except  wnere  the 
assumption  of  t ime- invar iance  is  explicitly 
mentioned.  If  J(s.  )  again  represents  the 
performance  value  incurred  by  occupancy  of  state  s 
at  time  me  nave  that: 


Per  f.  —  2L  J 

k=l 

where  k  is  the  length  of  the  mission  expressed  in 
numoer  of  time  steos.  This  performance  value  is 
ranaom  Because  the  OSH  followed  by  the  system  is 
random.  Clearly,  if  we  can  calculate  the 
prooaDility  of  each  OSH  that  the  system  can 
follow,  then  the  characterization  of  the  pmf  of 
the  system  performance  value  will  be  complete. 

A  typical  OSH  over  k  time  steps  takes  the  form: 


[ J  j  si j  •  •  ■  )  1 j 


which  is  a  list  of  the  states  occupied  by  the 
system  at  each  of  the  k  time  steps.  Here,  the 
system  initially  occupies  state  j  and  it  occupies 
state  i  at  the  k-th  time  step.  Suppose  there  are 
t..(k)  such  OSHs,  all  beginning  in  state  j  and 
ending  in  state  i  at  th  k-th  time  step  but 
traversing  many  different  states  in  between.  For 
the  1-th  such  OSH,  let  its  probability  Be  given  by 
p . . ( 1 ,k )  and  the  accumulated  value  of  performance 
bdJdenoteo  J.,(!,k).  He  define  the  performance 
transform  or  v-transform  for  this  OSH  as: 


lj  \n/ 

'"ijW)  Pii(l,k)  V 

J  T  -  i  1  J 


J;/U) 


The  v-transfprm  a  compact  way  of  representing 
the  comolete  statistical  characterization  of  the 
performance  of  the  system.  Among  its  properties 
are  the  following.  If  we  set  v  to  unity  in  the 
v-transform,  we  obtain: 


+-,(k) 

k  ^ 


which  is  the  probaoility  of  reeaching  state  i  from 
state  j  in  k  time  steos,  i.e.  the  multisteo 
transition  orobability  from  state  j  to  state  i. 

If  we  differentiate  the  v-transform  with  respect 
to  v  and  then  set  v  to  unity  in  the  result,  we 
obtain: 


1*1 


which  is  the  exoected  value  of  the  performance 
after  k  time  steps.  This  moment-generating 
property  of  the  v-transform  extends  to  all  higher 
moments  of  the  performance  value  as  well. 

Because  the  performance  values  associated  with 
occupancy  of  eacn  state  are  integer-val ued ,  the 
exponents  in  the  v-transform  are  integers. 
Therefore,  the  v-transform  is  always  a  polynomial 
in  v.  The  v-transform  representation  of  the 


ber.avior  of  the  performance  value  can  therefore  be 
made  even  more  compact  by  comoimng  terms  in  the 
polynomials.  This  procedure  ef*ect'vely  merges 
OSHs  wnose  beginning  and  ending  states  are  trie 
same  ana  wnose  cumulative  performance  values  are 
identical.  The  properfes  cited  aoove  *or  the 
v-transform  remain  in  force  after  this  combination 
of  terms. 

The  matrix  of  v-transforms  for  all  starting  and 
ending  states  is  denoted  Its  propagafon 

m  time  is  governed  by  the  difference  equation: 

M(v,k.l)  =  VkM  M(v,k)  ; 

wnere  V. (v)  is  the  single-step  v-transform  update  ; 
matrix  effective  at  time  step  x.  V. (v)  is  " 

cpnstructed  from  ?  by  multiplying  e§ch  row  of  P  by  * 
v  raised  to  the  power  of  the  performance  incurred  - 
by  occupancy  of  the  corresoonding  state  for  one  “ 
tiem  step  at  time  k.  If  these  performance  values 
are  time-invariant,  then  V. (v)  reduces  to  V(v)  and  * 
the  difference  eguation  becomes: 

M(Vj  k+l)  =  V(v)  M(v;  IsJ  3 


The  combination  of  terms  described  earlier  can  be 
applied  at  each  time  step  to  reduce  somewhat  the 
number  of  terms  in  the  polynomials  comprising 
M(v ,k ) .  Note,  however,  that  the  problem  of 
keeping  tracx  of  a  large  numoer  of  OSHs  has  not 
been  eliminated  but  merely  converted  into  the 
problem  of  keeDing  track  of  a  large  lumber  of 
polynomial  terms. 

2y  successively  applying  the  difference  equation, 
we  can  generate  the  v-transform  matrix  M(v,k  ). 

The  v-transform  of  the  performance  of  the  system 
assuming  it  started  in  state  j  and  did  NOT  reacn 
system  loss  during  the  mission  is  then  given  by: 

W-(v,kJ  =  Z.  "ijM-) 

1=1 

Since  it  is  frequently  the  case  that  the  system  is 
known  to  begin  the  mission  with  all  components 
operating  ana  no  fault  detection  alarms,  it  is 
often  true  than  that  the  v-transform  of  the  system 
performance  over  the  mission  is  given  by  W. (v,k  ). 
This  v-transform  completely  represents  thelpmf  Of 
the  system  performance,  which  was  the  desired 
result.  However,  it  still  suffers  from  the  memory 
dif'icult’es  associated  with  keeping  track  of  a 
large  number  of  polynomial  terms  in  generating  it. 
An  approximation  will  now  be  discussed  that 
circumvents  this  difficulty. 

Assuming  once  again  that  the  performance  values 
are  time-invariant,  let  r  be  a  row  vector  of  the  N 


curred  by 


upymg  each 


of  the  N  states  for  one  time  step.  Let  R{k)  be 
the  row  vector  of  exoected  performance  after  k 
time  steos  starting  from  each  of  the  N  states  of 
the  model.  Then,  the  theory  of  Harxov  processes 
with  rewards  [10]  yields  the  following  result: 


R(k)  = 


rZ  P' 

n*  t 


Because  state  N  is  the  SI  traoomg  state,  the 
elements  of  A(k)  all  tend  toward  a  steady  state 
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asymptote  whicn  is  linear  with  a  slope  of  r 
Consider  this  value  for  a  moment.  It  is  the 
performance  value  incurred  for  occupancy  of  tne  SL 
state  for  a  single  time  step.  Note,  nowever,  that 
to  the  system  designer,  the  fact  that  tne  system 
has  reached  the  system  loss  state  means  that  tne 
system  is  no  longer  capaole  of  operating. 
Therefore,  its  "performance"  upon  reaching  this 
level  of  degradation  is  irrelevant.  Hence,  the 
value  chosen  for  r^  is  irrelevant  except  to  the 
aenavior  of  Rile).  In  lignt  of  this  fact,  we 
cnoose  r  egual  to  zero  to  avoid  a  steady  state 
increae  in  the  values  of  R(k). 

Let  us  consider  again  the  interpretation  of  3 ( k ) . 
The  elements  of  R ( k )  are  the  expected  values  of 
the  total  accumulated  performance  over  k  time 
steps  starting  from  each  of  the  model  states  at 
time  0.  Since  the  system  usuaHy  starts  from 
state  l,  let  R.  ( k )  oe  denoted  J ( it ) .  This  is  tne 
exoected  performance  over  <  time  steps  for  ALL 
OSHs  oeginntng  in  state  l,  including  those  which 
end  in  state  N,  the  SL  trapping  state.  Note 
again,  however,  that  OSHs  ending  in  the  SL  state 
are  not  of  interest  in  performance  evaluation 
(except  in  tne  computation  of  tne  system 
unreliability).  Therefore,  ~(k)  can  oe  decomposed 
into  two  parts:  tne  portion  7,(k)  which  is  tne 
expected  performance  fo-  thosi  QSHs  not  ending  in 
the  SL  state  and  therefore  of  interest,  and  tne 
portion  Tc ,  ( k )  wmch  is  the  expected  performance 
accumulate  Py  those  OSHs  ending  in  the  SL  state 
and  therefore  not  of  interest.  Figure  1 
illustrates  the  relationship  between  these  three 
quantities  for  a  typical  example.  Note  that  the 
mission  time  x  is  typically  snort  relative  to  the 
fme  at  which  the  exoected  performance  Pehavior 
approaches  steady  state. 

•  With  set  to  zero,  it  is  a  -elatively  easy 
-  matter  to  generate  the  elements  of  R ( k )  ‘or  any  k 
Z  and,  in  particular,  to  generate  Rlk  ).  This  can 
be  done  using  mooal  decomposition  [9]  on  any  other 
_  numerically  well-oenaved  algorithm.  *R(k  1  can 
~  then  be  used  'n  the  ‘o' lowing  approximation 
sememe.  Note  that  R^k^)  ’S  an  uooer  pound  for 
the  exoected  performance  accumulated  over  time 
steps  beginning  erpm  state  i  and  is  therefore  also 
an  upper  pound  ‘or  the  exoected  performance  to  oe 
accumulated  over  n<k  fme  steps  peginning  ‘rpm 
.  state  i.  Consider  a "time  step  k  at  which  we  nave 
generated  the  v-transform  matrix  H(v,k)  "hose 
(i,j)-th  element  is  m,.,(v,k)  wmen  jn  turn 
comprises  many  terms  or  the  form  Av3  wnere  we  nave 
assumed  that  terms  with  like  exponents  nave 
already  been  combined.  The  approximation  we  shall 
use  is  produced  by  neglecting  all  such  terms  in 
m..,(v,x)  that  are  suen  that: 

’  J 

A’[b+  R;(km)]  <  folerahci 

This  has  the  effect  of  discarding  all  OSHs  at  time 
‘k  wmch  are  exoected  to  have  a  small  contr-bution 
to  the  statistical  properties  of  ‘he  performance 
over  the  mission.  Note  that  OSHs  that  nave 
accumulated  only  a  small  performance  value  jo  to 
time  k  and  have  a  small  probability  might  still  be 
retained  by  this  approximation  if  -t  is  expected 
that  they  will  accumulate  a  large  performance 
value  during  the  remainder  of  the  mission.  This 
maxes  ‘he  approx  imafon  much  less  risxy  than 
discarding  all  OSHs  wnose  contribution  to  tne 
expected  performance  at  fme  x  is  small  without 
regard  to  wnat  their  future  contribution  might  be. 


of  less  accuracy.  This  tradeoff  is  also  examined 
briefly  m  tne  next  Section. 


3.  Results 

In  this  Section,  we  briefly  summarize  seme 
numerical  results  for  a  50-state  moael  of  a 
fault-tolerant  system.  The  overall  system  is 
assumed  to  comprise  an  actuator  subsystem  ana  a 
sensor  subsystem.  These  two  subsystems  are 
identical  in  their  reounoant  architecture  and 
their  RH  'ogic  but  are  completely  inaependent 
otherwise.  The  Harkov  moael  ‘or  one  subsystem  is 
shown  in  Figure  2  wnere  0  reoresents  a  correct 
detection  of  a  failure,  0  reoresents  a  "missed" 
detection,  0  represents  a  false  detection,  I 
reoresents  tne  isolation  of  a  failure  following  a 
detection,  I  represents  no  isolation  following  a 
detection,  and  I  represents  the  isolation  of  the 
wrong  component  following  a  detection.  Table  1 
lists  the  values  of  the  conditional  probabilities 
of  these  events  for  each  time  step  that  were 
assumed.  The  actuator  subsystem  was  assumed  to 
consist  of  components  whose  mean  time  to  failure 
was  25  hours.  The  sensor  subsystem  components 
were  assumed  to  have  a  mean  time  to  failure  of  100 
hours.  The  time  step,  which  corresponds  In  such 
models  to  the  time  between  successive  failure 
detection  tests,  was  assumed  to  be  1  second.  The 
performance  associated  with  occupancy  of  each  of 
the  states  of  the  model  was  based  in  the  case  of 
the  sensors  upon  the  acnievaDle  accuracy  of  the 
estimation  of  a  three-dimensional  guantity 
measured  by  the  sensor  array.  A  failed  sensor  was 
assumed  to  produce  a  measurement  with  an 
additional  error  of  3  relative  to  a  good  sensor 
wnere  is  the  standard  deviation  of  the  random 
•a— -or  in  the  measurement  from  one  sensor.  The 
actuator  performance  values  were  scaled  ud  from 
tne  sensor  performance  values  to  reflect  the 
increased  importance  to  a  control  system  of  the 
actuators,  details  on  the  model  construction  can 
be  founa  in  [91. 

when  the  two  independent  models  are  combined,  the 
Overall  system  model  consists  of  49  operational 
states  plus  a  SL  state  fer  a  total  of  50  states. 

Of  course,  in  this  particular  case  there  is  no 
neeo  to  cPmoine  the  subsystem  models  into  an 
overall  model  in  light  of  their  independence. 
However,  we  do  so  here  in  order  to  demonstrate  the 
sdpI  icabil ity  of  our  method  to  large  models,  which 
are  tvpical  in  the  field  of  fault-tolerant  system 
performance  evaluation. 

The  results  described  here  were  generated  on  a 
modified  Hewl ett-°ackard  982SU  microcomouter .  ‘he 
major  'imitation  was  the  limited  amount  of  memory 
availaDle  for  use.  As  a  result,  results  could 
only  be  generated  for  the  50-state  model  up  to  111 
time  steps  wnen  the  tolerance  was  very  small.  It 
should  be  noted,  however,  that  in  Hi  time  steps 
as  many  as  30,000  OSHs  must  be  xeot  track  of  even 
after  merging  those  that  have  the  same  ending 
states  and  same  Derformance  values.  A  computer 
with  virtual  memory  allows  for  much  longer  runs. 
Nevertheless,  the  characteristics  exhibited  by  the 
results  after  111  time  steps  are  sufficient  to 
illustrate  the  ’nsignt  that  can  be  gained  from  a 
performance  evaluation  tool. 


In  tne  next  Section,  a  rule  of  thumb  'S  suggested 
for  setting  the  tolerance  value  aooearmg  in  the 
aooroi imat ion.  Note  that  larger  tolerances  result 
•n  more  discarded  terms  and  hence  'ess 
computational  effort  and  memory  burden  at  a  cost 


Figures  3,  4  ana  5  illustrate  the  effect  of  the 
tolerance  level  m  the  approximation  on  the 
results,  each  is  a  plot  of  the  computed 
performance  pmf  after  150  time  steps  for  a  7-state 
model  which  is  similar  in  scope  to  the  3-state 
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noce'  for  oac".  of  tne  subsystems  desc-'bed  above. 
The  probability  axis  (vertical)  on  eacn  blot  is 
logarithmic.  The  point  at  wmcn  the  tolerance 
begins  to  have  a  orofoona  effect, on  the  results  's 
at  a  tolerance  level  between  1Q'J  and  10  .  The 

total  exoected  performance  'or  this  system  at  150 
time  steps  (i.e.  the  expected  value  of  the 
performance  accumulated  by  OSHs  up  to  this  time 
point  without  'egard  to  wnether  or  not  they  have 
reached  the  SI  state)  is  52.?.  Hence,  the 
performance  pmf  results  begin  to  breax  down  wnen 
the  tolerance  reaches  a  value  approx imately  4 
decades  below  the  total  expected  performance  for 
tne  mission  (which  can  be  calculated  easily  by  tne 
Harxov  process  with  rewards  result).  In  all  of 
the  results  generated  in  this  study  so  far,  this 
nas  been  a  good  rule  of  thumb:  Set  the  tolerance 
at  least  4  decades  below  tne  total  exoected 
performance  to  avoid  inaccurate  results  for  the 
approximate  per'ormance  pmf. 

Returning  to  the  50-state  model,  the  value  of  its 
expected  performance  at  350  time  steps  is  4560. 

By  the  rule  of  thumb  above  then,  the  tolerance 
should  be  set  no  larger  than  0.4  to  generate 
reasonably  accurate  results  for  the  performance 
pmf.  Figure  6  is  the  performance  pmf  for  the 
50-state  model  after  350  time  steps  using  a 
tolerance  of  0.1.  Note  that  this  value  of  the 
tolerance  nas  allowed  propagation  of  the 
v-transform  matrix  to  a  hum oer  of  steps  at  which 
as  many  as  100,000  different  OSHs  would  have  to  be 
kept  tracx  of  were  it  not  for  the  approximation. 

Further  results  for  these  models  are  given  in  [9], 
wmcn  also  uses  the  modal  decomposition  of  the 
result  form  the  theory  of  Harkov  processes  with 
rewards  to  generate  a  reduced  order  model  that 
approximates  the  performance  oenavior  of  the 
50-state  model . 


4.  Brief  Discussion  of  Other  Work 

A  re'ated  researcn  effort  is  currently  exploring 
another  avenue  toward  the  generation  of 
approximate  performance  evaluation  results  for 
fault-tolerant  systems.  This  work  exoloits  the 
separation,  in  time  scales  of  the  failure  oenavior 
of  components  and  the  benavior  of  the  tests  used 
to  detect  and  isolate  those  failures.  In 
particular,  a  wel 1 -designed  faul t-tdlerant  system 
includes  a  failure  detection  mechanism  wmcn 
detects  and  isolates  failures  very  quickly.  On 
tne  other  hand,  the  failures  themselves  tend  to 
occur  only  rarely  and  are  therefore  consioeraoly 
soread  out  in  time.  If  a  finite  state  Harkov 
model  or  semi-Harkov  model  is  constructed  to 
reoresent  the  behavior  of  the  fault-tolerant 
system,  then  it  oftentimes  naturally  deccmooses 
into  classes  of  states.  The  states  within  each 
class  are  such  that  the  transitions  between  them 
are  frequent  with  small  holding  times  as 
determined  by  the  failure  detection  decision 
processes.  Heanwmle,  the  transitions  between  the 
classes  are  governed  by  the  failure  processes  and 
are  therefore  much  slower  and  less  frequent.  Once 
the  system  model  is  decomposed  in  this  fashion,  it 
is  almost  in  a  form  to  whicn  some  recent  results 
from  tne  theory  of  Harxov ian  processes  with  rare 
events  can  be  aoolied.  However,  for 
f aul t-to 1 erant  system  models,  there  remain  a  few 
difficulties.  Overcoming  these  difficulties  ■$ 
the  subject  of  our  current  efforts. 


5.  ConcluS'dn 

In  this  paper,  *e  nave  briefly  described  seme 
approx’mate  teenniques  *'cn  evaluati-g  t-.e 
statistical  properties  of  fe  per'crmarce  of 
'ault-tolerant  control  systems,  is  s„cn  sistems 
come  into  wider  use,  the  avai’aoi’-ty  or  design 
tools  baseo  uoon  performance  evaluation  teenniques 
will  be  ’ncreasingly  important,  ’he  -’et-'Od 
Oescr'Pefl  mere  circumvents  the  difficulty  of 
d ’mens ional i ty  encountered  by  stra ignt'orward 
combinatorial  and  Harrovian  techniques  by 
introouc'ng  the  v-transform  '■epreseotat'bn  ana 
then  using  't  to  suggest  an  approximate 
simpl  ificat’on  wmcn  -ne'eases  ccnsioeraDly  the 
efficiency  of  tne  performance  evaluation  algorithm 
for  large-scale  moaels  without  sacrificing 
significant  accuracy.  Some  oumer'eal  results 
illustrate  a  rule  of  thump  fa'  us’ng  the  algorithm 
and  illustrate  some  of  the  useful  performance 
properties  that  result. 
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